Some protein interaction data do not exhibit power law statistics 
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Abstract 

It has been claimed that protein-protein interaction (PPI) 
networks are scale-free based on the observation that the node 
degree sequence follows a power law. Here we argue that 
these claims are likely to be based on erroneous statistical 
analysis. Typically, the supporting data are presented using 
frequency-degree plots. We show that such plots can be 
misleading, and should correctly be replaced by rank-degree 
plots. We provide two PPI network examples in which the 
frequency-degree plots appear linear on a log-log scale, but the 
rank-degree plots demonstrate that the node degree sequence 
is far from a power law. We conclude that at least these PPI 
networks are not scale-free. 

Keywords: Protein-protein interaction (PPI) networks, node 
degree sequence, power law, rank-degree plot. 
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1 Introduction 

Experimental data on protein-protein interaction (PPI) net- 
works have been extensively gathered with the aim of acquir- 
ing a system-level understanding of biological processes flll2l. 
Various statistical features of complex graphical structures 
have received attention, including the size of the largest con- 
nected component, the node degree distribution, the graph di- 
ameter, the characteristic path length, and the clustering coeffi- 
cient. However, the feature that has attracted the most attention 
is the distribution of node degree (the number of links from 
a node) and whether or not the distribution follows a power 
law (linear plot on log-log scale). The degree distribution of 
PPI networks was claimed to follow a power law in Q, and 
thus PPI networks are considered to be "scale-free" (SF) 0, 
a generic property of network topologies common to various 
networks in different domains, from social networks and bio- 
logical systems to the Internet. 

Although "scale-free" has not been clearly defined in the ex- 
isting literature [4], most treatments assume that a power law 
node degree distribution is an important, and sometimes defin- 
ing feature. Other characteristics described in the SF literature 
include failure tolerance but attack vulnerability at hubs (nodes 
possessing high degree) and various kinds of self-similarity. A 
recent attempt at a more theoretically rigorous treatment 1 5 1 
shows however that no additional features follow from power 
law node degree sequence alone, and require additional restric- 
tions, such as high likelihood of occurrence by random gen- 
eration (e.g. by preferential attachment). Other work j6| El 
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has highlighted important differences between PPI networks 
and "SF networks" constructed by a stochastic growth model. 
Moreover, one may question the rigor with which the power 
law node degree distribution, the primary feature of SF net- 
works, has been demonstrated in certain examples. 

This letter shows that the node degree sequences of some 
published PPI networks are better described by an exponential 
function when properly plotted and analyzed. The problem 
with previous work is that data were plotted using frequency- 
degree plots, as is common in papers purporting to discover 
power laws in complex biological systems, which lead to sys- 
tematic errors compared with rank-degree plots. We demon- 
strate here that data plotted on a loglog scale frequency-degree 
plot may appear to be linear, but when the same data are plot- 
ted on a loglog scale rank-degree plot, they are clearly shown 
not to be power law. Thus, the data for some PPI networks 
lack even the minimal features of scale-free networks. 

2 Materials and Methods 

Publicly available data for PPI networks represent only an 
approximation of the real interaction network because of the 
large number of false positive and false negative interactions. 
However, because of the assumed self-similarity features of 
SF networks, it has been claimed that if the real PPI network is 
SF, then any appropriately sampled subnetwork is also SF 1 8 1. 
Thus, we might still gain valuable information by examin- 
ing whether the publicly available PPI network data possess 
a power law node degree distribution characteristic of SF net- 
works. 

A finite sequence of node degrees y = (j/i,J/2> • • • j Vn) of 
integers, assumed without loss of generality always to be or- 
dered such that yi > y% > . . . > y n , is said to follow a power 
law if 

k « cy k ~ a , (1) 

where k is (by definition) the rank of y^, c > is a constant, 
and a > is called the scaling index. Because of the ordering, 
the rank k is the number of nodes with the degree equal or 
larger than y^. Since logfc = logc — alogy^, the rank k 
versus the node degree y^ plot on a loglog scale appears as a 
straight line of slope —a. In contrast, y is said to follow an 
exponential if 

k^aexp- bVk , (2) 

where a > and b > are constants. The k versus y). plot 
on a semilog scale approximates a straight line of slope of —6 
since log k = log a — byt- 

Note that the rank-degree relationships Q and Q are non- 
stochastic, in the sense that there need be no assumption of 
an underlying probability distribution for the sequence y. In- 
deed, no coherent justification has been given for why bio- 
logical networks should be viewed as samples from a random 



ensemble. On the contrary, what is known of evolution would 
suggest that it yields extremely nonrandom structure at every 
level of organization. Nevertheless, random graphs have been 
remarkably popular models for biological networks, but have 
led to substantial confusion, particularly with regard to power 
laws. Suppose a non-negative random variable X has cumu- 
lative distribution function (CDF) F(x) = P[X < x]. In this 
stochastic context, a random variable X or its corresponding 
distribution function F is said to follow a power law with in- 
dex a > if, as x — > oo, 

P[X > x] = 1 - F(x) w cx~ a , (3) 

for some constant c > and a tail index a > 0, where 
f(x) w g(x) as x — > oo if f(x)/g(x) — > 1 as x — » oo. 
We call (|3} the stochastic form of power law rank-degree re- 
lationship. The loglog plot of P[X > x] versus x appears as 
a straight line of slope —a for large x. If the CDF F(x) sat- 
isfying (0 is differentiable, then its derivative, the probability 
density function f(x) = ■4-F(x), satisfies 

f{x)^c'x- {1+a l (4) 

The loglog plot of f{x) versus x also would be a line of slope 
— (1 + a). In contrast to the rank-degree relationships Q and 
0, the definitions in Q and 10} are stochastic and require 
an underlying probability model. As is standard in physics, 
the SF literature almost exclusively assumes some underlying 
stochastic models, and power law node degree distributions 
are typically investigated in terms of the frequency-degree re- 
lationship based on the probability density function f(x). 

In the case of node degree of graphs the data is inherently 
discrete. Even if the data were sampled from some ensemble, 
F{x) is not differentiable and the frequency-degree plots sim- 
ply do not make sense and can easily lead to mistakes. Further- 
more, differentiation of noisy data, such as PPI data, amplifies 
errors, making frequency-based data uninformative and am- 
biguous. A typical approach to overcome these problems is to 
smooth the data or to group individual data values into a small 
number of bins, and then plot the relative number of data val- 
ues in each bin. The problem is that this smoothing or binning 
process can dramatically change the nature of frequency-based 
statistics as will be shown below (Figs[2and[2)- This use of 
ad hoc statistical analysis can lead to concluding incorrectly 
that a power law relationship is present (or absent). This prob- 
lem is easily avoided if one were to make rank-degree plots of 
raw data instead of using frequency-degree plots to check the 
power law or exponential relationships in Q and (|2}. 

From among many publicly available studies on PPI net- 
works, we used the filtered yeast interactome (FYI) data set 
1 10 1 and the predicted human protein-interaction (HPI) map 
II II to illustrate these points. Much of the original data suffers 
from numerous false positives and false negatives, but more 
recent investigations have sought to refine the data. For exam- 
ple, the FYI data set contains high-confidence interactions for 
yeast, each observed by at least two different methods, thereby 
enriching for genuine positives. The HPI map was generated 
using data from seven experimental and four computationally 
predicted protein-interaction maps from Saccharomyces cere- 
visiae 111 21 1131 1141 1151 1161 1171 . Drosophila melanogaster\ 18 1 
and Caenorhabditis elegans 1191 . The idea is that a human 
protein interaction can be predicted if orthologs in a model or- 
ganism show an interaction. Its accuracy has been assessed in 
II II . We consider both FYI and HPI to be refined data sets, 
and investigate whether their node degree sequences follow a 
power law, a defining feature of scale-free networks, by rank- 
degree plots. 



3 Results and Discussion 



The rank-degree plots of the HPI and FYI data are shown in 
(a) loglog scale and (b) semilog scale in Figs. [fland|2] respec- 
tively. The straight lines and the dotted curve in loglog scale 
(a) show least-squares fitting of data to a power law with the 
value of its slope and to an exponential, respectively. The same 
fittings are depicted as the curve and the dotted straight line in 
semilog scale (b). From these figures, we can clearly conclude 
that the node degree sequences of HPI and FYI data are much 
closer to an exponential (0, and are clearly not power laws 
0. More sophisticated statistical analysis can be used to con- 
firm these conclusions. In addition, the rank-degree plots show 
raw data and readers can easily judge at a glance the relative 
suitability of various models. 

However, using frequency-degree plots (c) in Figs^and[2] 
could lead to the erroneous conclusion that the node degree 
sequence appears to follow a power law, although the correct 
rank-degree plot clearly shows that this is not the case. Fur- 
thermore, even if the PPI data were a power law, the slope for 
frequency-degree plot — /3 is simply not related to the slope for 
the rank-degree plot —a by /3 = a + 1, as holds for differ- 
entiable distributions. These results conclusively demonstrate 
that these two refined PPI data sets are not power laws, and 
thus certainly not scale-free, no matter how this is defined. 

It is in principle possible that the data studied here is mis- 
leading and real PPI networks might have some features at- 
tributed to scale-free networks. At this time we only can draw 
conclusions about (noisy) subgraphs of the true network since 
the data sets are incomplete and presumably contain errors. 
However, the fact that these subgraphs exhibit an exponential 
node degree sequences suggests that the entire network is not 
SF. Appropriately sampled subraphs of a SF graph should be 
SF, and hence possess a power law node degree sequence. Fur- 
thermore, a SF network possessing significant non-SF subnet- 
works could not be considered to be self-similar, a typically as- 
sumed though as yet unproven feature of scale-free networks. 
Finally, since essentially all claims that biological networks 
are scale-free are based on error-prone frequency-degree anal- 
ysis, this analysis must be completely redone to determine the 
correct form of the degree sequences. 

It has also been shown 1201 1211 that the Internet and cell 
metabolism, the two most prominent examples of SF net- 
works, might have power laws for some degree sequences, 
but have none of the other features attributed to scale-free net- 
works. One important feature of the Internet and metabolic 
networks is the complete absence of centrally located high- 
degree hubs which are responsible for global network connec- 
tivity and whose removal would fragment the network, in con- 
trast to what has been claimed in the SF literature. Metabolic 
networks have also been shown to be scale-rich (SR), but not 
SF, in the sense that they are far from self-similar [21 1 de- 
spite some power laws in certain node degree sequence. Their 
power law node degree sequence is a result of the mixture of 
exponential distributions in each functional module. In prin- 
ciple, PPI networks could have this SR structure as well, and 
perhaps power laws could emerge at higher levels of organiza- 
tion. This will be revealed only when a more complete network 
is elucidated. Still, the most important point is not whether the 
node degree sequence follows a power law, but whether the 
variability of the node degree sequences is high or low 1211 . 
and the biological protocols that necessitate this high or low 
variability. These issues will be explored in future publica- 
tions. 
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Figure 1: Node degree distribution of human protein- 
interaction map II II : (a) rank-degree plot in loglog scale, (b) 
rank-degree plot in semilog scale, and (c) frequency-degree 
plot in loglog scale. The rank-degree plots indicate that the 
degree distribution is exponential. The straight lines (a,c) and 
the dotted curve (a) in loglog scale are the least-squares fits 
of the data to the power law (with the value of the slope) and 
to the exponential distributions, respectively. The straight line 
and the dotted curve in loglog scale (a) become the curve and 
the dotted line in semilog scale (b). Still, the frequency-degree 
plot in (c) might appear visually to follow a power law, and 
can lead to potential errors of finding power law node degree 
distribution. 



Figure 2: Node degree distribution of 'filtered yeast interac- 
tome' (FYI) data set 1 13 : (a) rank-degree plot in loglog scale, 
(b) rank-degree plot in semilog scale, and (c) frequency-degree 
plot in loglog scale. The rank-degree plot (a,b) shows the non- 
power law distribution, which is not evident in the frequency- 
degree plot (c). The straight lines (a,c) and the dotted curve 
(a) in loglog scale are the least-squares fits of the data to the 
power law (with the value of the slope) and to the exponen- 
tial distributions, respectively. The straight line and the dotted 
curve in loglog scale (a) become the curve and the dotted line 
in semilog scale (b). 



